![]() |
The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson Wiley Computer Publishing, John Wiley & Sons, Inc. ISBN: 0471353817 Pub Date: 03/01/99 |
Previous | Table of Contents | Next |
Larger key sizes require more RAM to store the larger keys: 36 bytes for 192-bit keys and 48 bytes for 256-bit keys. If these applications can store key material in ROM or EEPROM, then these key lengths can be implemented on smart cards with only 36 bytes of RAM. All of this RAM can be reused for other purposes between block encryption operations.
For smart cards with larger memory to hold key-dependent data, encryption speed can increase considerably. This is because the round keys can be precomputed as part of the expanded key, requiring a total of 184 bytes of key memory. As shown in Table 5.4, this option nearly halves the encryption time. If the smart card has enough additional memory available to hold 1 Kbyte of precomputed S-box in either RAM, ROM, or EEPROM (for a total of 1208 bytes), performance improves further. Finally, as shown in the final row of Table 5.4, if the entire precomputed S-box plus MDS table can be held in memory (3256 bytes), the speed can again be increased slightly more. It should be noted that some of these large RAM implementations save 512 bytes of code space by assuming that certain tables are not required in ROM, with the entire precomputation being performed instead on the host that sets the key in the smart card. If the smart card has to perform its own key expansion the code size will increase. This increase has its own space/time tradeoff options.
This flexibility makes Twofish well-suited for both small and large smart-card processors: Twofish works in the most RAM-poor environments, while at the same time it is able to take advantage of both moderate-RAM cards and large-RAM cards.
On a 6805 with only 60 bytes of RAM, Twofish encrypts at speeds of 26500 to 37100 clocks per block, depending on the amount of ROM available for the code. On a 4 MHz chip, this translates to 6.6 msec to 9.3 msec per encryption. In these implementations, the key-schedule precomputation time is minimal: slightly over 1750 clocks per key. This setup time could be cut considerably at the cost of two additional 512-byte ROM tables, which would be used during the key schedule.
If ROM is expensive, Twofish can be implemented in less space at slower speeds. The space-speed tradeoffs are of two types: unrolling loops and implementing various lookup tables. By far, the latter has the larger impact on size and speed. For example, Twofishs MDS matrix can be computed in three different ways:
Longer keys are slower, but only slightly so. For the small memory versions, Twofishs encryption time per block increases by less than 2600 clocks per block for 192-bit keys, and by about 5200 clocks per block for 256-bit keys. Similarly, the key schedule precomputation increases to 2550 clocks for 192-bit keys, and to 3400 clocks for 256-bit keys.
As shown in Table 5.4, in smart card CPUs with sufficient additional RAM storage to hold the entire set of subkeys, the throughput improves significantly, although the key setup time also increases. The time savings per block is over 11000 clocks, cutting the block encryption time down to about 15000 clocks; i.e., nearly doubling the encryption speed. The key setup time increases by roughly the same number of clocks, thus making the key setup time comparable to a single block encryption. This approach also cuts down the code size by a few hundred bytes. It should be noted further that, in fixed-key environments, the subkeys can be stored along with the key bytes in EEPROM, cutting the total RAM usage down to 36 bytes while maintaining the higher speed.
As another tradeoff, if another 1 Kbyte of RAM or EEPROM is available, all four 8-bit S-boxes can be precomputed. Clearly, this approach has relatively low key agility, but the time required to encrypt a block decreases by roughly 6000 clocks. When combined with precomputed subkeys as discussed in the previous paragraph, the block encryption time drops to about 12000 clocks, nearly three times the best speed for low RAM implementations. In most cases, this approach would be used only where the key is fixed, but it does allow for very high throughput. Similarly, if 3 Kbytes of RAM or EEPROM is available for tables, throughput can be further improved slightly.
The wide variety of possible speeds again illustrates Twofishs flexibility in these constrained environments. The algorithm does not have one speed; it has many speeds, depending on available resources.
Twofish code is very compact: 1760 to 2200 bytes for minimal RAM footprint, depending on the implementation. The same code base can be used for both encryption and decryption. If only encryption is required, minor improvements in code size can be obtained (on the order of 150 bytes). The extra code required for larger keys is fairly negligible: less than 100 extra bytes for a 192-bit key, and less than 200 bytes for a 256-bit key.
Observe that it is possible to save further ROM space by computing q0 and q1 lookups using the underlying 4-bit construction, as specified in Section 4.3.5. Such a scheme would replace 512 bytes of ROM table with 64 bytes of ROM and a small subroutine to compute the full 8-bit q0 and q1, saving perhaps 350 bytes of ROM; unfortunately, encryption speed would decrease by a factor of ten or more. Thus, this technique is only of interest in smart card applications for which ROM size is extremely critical but performance is not. Nonetheless, such an approach illustrates the implementation flexibility afforded by Twofish.
The 64-bit Alpha 21164 CPU can run up to 600 MHz using only a 0.35 micron CMOS process, compared to the 0.25 micron technology used in a Pentium II. The Alpha is widely regarded as the fastest general purpose processor available today. Its architecture and performance are expected to remain at the leading edge of technology for the foreseeable future. It has a 4-way superscalar architecture, which is fairly close in many respects to a Pentium II. Twofish should run on an Alpha in roughly the same number of clocks as on a Pentium Pro (i.e., 300).
Previous | Table of Contents | Next |